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Method for automatically matching 
graphic elements and phonetic elements 

REFERENCE TO RELATED APPLICATION 

5 

This application is a continuation of the PCT 

International Application No. PCT/FR2004/03278 filed 

December 11, 2004, which is based on the French 
Application No. 0314928 filed December 18, 2003. 

10 

BACKGROUND OF THE INVENTION 

1 - Field of the Invention 

The present invention relates generally to the 
15 automatic extraction of linguistic knowledges in a corpus 
of transcriptions of graphic chains into phonetic chains. 
It relates more particularly to the transcription of 
typographic elements such as characters in a 
predetermined language into phonetic elements . 

20 

2 - Description of the Prior Art 

At present, each word of a language constitutes a 
graphic chain that is transcribed phonetically into a 

25 chain of phonemes by a phonetician. For any new word to 
be added to a training corpus, the phonetician must 
intervene to transcribe the new word phonetically. Thus 
the training corpus furnishes only global 
grapheme /phoneme transcriptions. For example in the 

30 global transcription "ruelle"/ [ryel] , the corpus 
indicates that, globally, the graphic chain "ruelle" is 
translated into a phonetic chain. However, it is not made 
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explicit that the typographic element "r" is 
retranscribed phonetically in some unitary way. The 
global transcription does not indicate also the syllables 
or graphemes constituting the graphic chain and the 
phonetic elements constituting the phonetic chain. 

One or more phonetic chains associated with any 
graphic chain can be determined from the known elementary 
transcription of each typographic element by character by 
character analysis of the graphic chain. Error corrector 
systems find the phonetic transcriptions useful for 
recognizing lexical errors in entering text on a 
keyboard. There is therefore a need to extract more 
refined elementary transcriptions from a raw 
transcription. 

OBJECT OF THE INVENTION 

The invention aims to derive automatically from raw 
transcriptions of graphic chains, for example words and 
family names, into phonetic chains, transcriptions of 
graphic elements, for example characters, into phonetic 
elements constituting the phonetic chains, in order to 
segment any graphic chain into graphemes and any phonetic 
chain into phonemes automatically. The graphic element by 
graphic element, i.e. character by character, elementary 
transcriptions thereafter facilitate automatic global 
transcription of any additional graphic chain added to 
the corpus of graphic chains, in particular on the basis 
of a concatenation of phonetic elements matching on a one 
to one basis to the characters of the additional graphic 
chain. 
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SUMMARY OF THE INVENTION 

Accordingly, a method of the invention matches 
graphic elements constituting given graphic chains 
automatically to phonetic elements constituting 
corresponding phonetic chains after initially entering 
global transcriptions of the graphic chains into the 
phonetic chains into a database accessible by the 
computer and after estimating and storing in the database 
first probabilities of elementary transcriptions of 
graphic elements into respective phonetic elements. The 
method is characterized by the following steps: 

for each transcription of a given graphic chain 
with M graphic elements into a corresponding phonetic 
chain with N phonetic elements, determining by MxN 
iterations second probabilities of MxN second 
transcriptions of M graphic chains resulting from M 
successiveiy concatenati ons^ ^ of 1 to^ ite M graphic 
elements into N phonetic chains resulting from N 
successiveiy concatenati ons^ ^ of 1 to- fefee N phonetic 
elements, each second probability of a second 
transcription depending on a preceding estimated first 
probability of last graphic and phonetic element of said 
second transcription and depending on the highest of 
three respective second probabilities determined by 

preceding iterations, M and N being integers, a^ a 

function — — a — rGspoctivc — first — probability — arftd — &€ — feiie 

higheot three rocpoctivc occond probabilitioo 

dotormincd beforehand , and 

establishing and storing a link between the last 
elements of the graphic and phonetic chains of each 
second transcription and the last elements of the graphic 
and phonetic chains of the transcription relating to the 
highest of the three respective second probabilities in 
order for links established in an MxN matrix relative to 
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the second probabilities to constitute a single path 
between last and first pairs of graphic and phonetic 
elements of the matrix in order to segment the given 
graphic chain into graphemes corresponding to respective 
5 phonemes segmenting the corresponding phonetic chain and 
to store the matches between the graphemes and phonemes 
in the database, the niomber of graphic elements in a 
grapheme being identical to the number of phonetic 
elements in the corresponding phoneme, in order for any 

10 new graphic chain to be transcribed automatically into a 
phonetic chain segmented into phonemes by means of the 
stored matches . 

According to other features of the invention, the 
respective first probability for the determination of a 

15 second probability relating to a second transcription of 
a graphic chain concatenating m graphic elements into a 
phonetic chain concatenating n phonetic elements, with 
1 ^ m ^ M and 1 ^ n ^ N, relates to the last elements in 
the graphic chain with m graphic elements and the 

20 phonetic chain with n phonetic elements . The three 
respective second probabilities determined beforehand for 
the second transcription of the graphic chain with m 
graphic elements into the phonetic chain with n phonetic 
elements preferably and respectively relate to a second 

25 transcription of a graphic chain with m-1 graphic 
elements into the phonetic chain with n phonetic 
elements, a second transcription of the graphic chain 
with m graphic elements into a phonetic chain with n-1 
phonetic elements and a second transcription of the 

30 graphic chain with m-1 graphic elements into the phonetic 
chain with n-1 phonetic elements. 

For example, the invention transcribes phonetically 
from the corpus of global transcriptions such as 
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"ruelle" I [rysl] the graphic elements "r", "u", "e", "lie" 
into the respective phonetic elements [r] , [y] , [s], [1]. 

The invention may be regarded as similar to a 
process of syllabation which, by analysis, decomposes a 
5 global transcription into elementary transcriptions and 
locally matches grapheme /phoneme subtranscriptions . The 
division into initial graphemes and phonemes and the 
biunivocal matching of each graphic element to each 
phonetic element of the divided phonemes is called 
10 grapheme I phoneme alignment. In the above example, the 
invention produces the following alignment: 
"r" "u" "e" "lie" 

[r] [y] [£] [1**]. 

The symbol * denotes a mute and meaningless phonetic 
15 element. 

BRIEF DESCRIPTION OF THE DRAWINGS 

Other features and advantages of the present 
invention will become more clearly apparent from the 
20 reading of the following description of preferred 
embodiments of the invention, given by way of nonlimiting 
examples and with reference to the corresponding appended 
drawings, in which: 

- FIG. 1 shows an algorithm of the main steps of 
25 the automatic matching method of the invention; and 

- FIG. 2 shows an algorithm of the substeps of a 
step of the automatic matching method for determining 
individual first probabilities. 

30 DETAILED DESCRIPTION OF THE DRAWINGS 

As shown in FIG. 1, the method of the invention for 
automatically matching graphic elements and phonetic 
elements comprises main steps El to Ell. For example. 
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those steps are for the most part implemented in the form 
of software in a terminal, such as a personal computer or 
a mobile in a cellular radio communication network, and 
linked in particular to software system for orthographic 
5 correction of lexical errors which is insertable into a 
word processing system or a linguistic practice system. 
The terminal contains or is able to access a database of 
the type used in artificial intelligence. The database 
stores a corpus C of initial global transcriptions. 

10 Initially, in the step El, the global 

transcriptions (CG|CP) are constituted by pairs each 
matching a graphic chain CG such as a word in a 
predetermined language or a family name to a phonetic 
chain CP. These transcriptions are determined and entered 

15 by an expert in phonetics on a form displayed by the 
computer. The corpus C matches a priori graphic chains GC 
each composed of one or more typographic elements 
(characters) hereinafter called graphic elements gi of an 
alphabet G = {gi, gi} with I elements in the 

20 predetermined language, where 1 ^ i :^ M, to respective 
phonetic chains CP each composed of one or more phonetic 
elements pj of an alphabet P = {pi, pj} with J 

phonetic elements, where 1 :^ j ^ J and I # J. However, 
the segmentation of the chain CG into syllables or into 

25 graphemes each comprising one or more graphic elements 
and the segmentation of the chain CP into phonemes each 
comprising one or more phonetic elements are ignored at 
this stage. 

The alphabets G and P typically comprise around 30 
30 elements. There are therefore a total of 30 x 30 = 900 
possible pairs of graphic elements and phonetic elements. 
In practice, the corpus C contains at least 100 000 
global transcriptions of typographic chains CG into 
phonetic chains CP, which protects the invention from 
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coarse errors in the estimation of probabilities, as 
discussed below. 

In the step E2, first probabilities of elementary 
transcription P(gilpj) such that a graphic element gi 
matches the phonetic element pj are firstly estimated and 
stored in the database with the corpus C of global 
transcriptions . 

The estimated values of the first probabilities are 
as far as possible close to respective maximum 
probability values required for the method of the 
invention operating by iterations to converge quickly 
without retaining local maxima. 

The concatenated nature of the global 
transcriptions of the chains leads to the hypothesis of a 
correlation between the rank rg of the graphic elements 
in a graphic chain CG and the rank rp of the phonetic 
elements in the corresponding phonetic chain CP. For 
example, in the global transcription (beau | bo), it is 
more probable that the graphic element b, given its 
position at the ■ beginning of the chain CG, translates to 
a phonetic element [b] rather than a phonetic element [o] 
placed at the end of the corresponding chain CP. In this 
example, the correlation of the ranks moves the graphic 
elements [b] and [e] of the phonetic element [b] and the 
graphic elements [a] and [u] of the phonetic element [o] 
closer together. 

The algorithm for the initial estimation E2 of the 
first probabilities P(gilPj) comprises the following 
substeps E21 to E27. 

In the substep E21, IJ contingency numbers Kgipj 
respectively associated with the elementary 
transcriptions (gilPj) of a graphic element of the 
alphabet G and a phonetic element of the alphabet P are 
set to zero. The contingency number Kgipj is equal at the 
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end of the step E2 to the estimated number of times that 
the graphic element gi is retranscribed into the phonetic 
element pj in ' the various global transcriptions of 
typographic chains CG into phonetic chains CP included in 
the corpus C. 

For each chain transcription (CGI CP), as indicated 
in the substep E22, the ranks of the graphic elements in 
the chain CG and the ranks of the phonetic elements in 
the chain CP are normalized as a function of the 
respective lengths Ig and Ip of the chains CG and CP, 
which may be different. In the substep E23, the rank r of 
a phonetic element in the chain CP is derived from the 
rank rgi of a graphic element gi in the chain CG with 
which the phonetic element of rank r will be associated, 
in accordance with the following relationship: 

r = integer portion (rgi.lp/lg). 

The number Kgipj of contingencies associated with 
the elementary transcription of the graphic element gi 
into the phonetic element Pj is then incremented by 1 
only if the phonetic element pj is situated at the 
derived rank r in the chain CP, as indicated in the 
substeps E24 and E25. 

The substeps E22 to E25 are repeated for each 
global transcription (CG|CP) of the corpus C, as 
indicated in the substep E26. When all the global 
transcriptions of the corpus have been processed, the 
next substep E27- 2-6- estimates all the first probabilities 
P(gilPj) of elementary transcription between the graphic 
elements and the phonetic elements, in accordance with 
the following relationship for each graphic element gi: 
j=J 

P(gi I Pj) = Kgipj / EKgipi 
j=i 

after calculating the sum term in the denominator for the 
graphic element gi. 
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Referring again to FIG. 1, the matching process 
continues with steps E3 to ElO which segment each graphic 
chain CG read in the corpus of the database in order to 
match automatically and on a biunivocal basis each 
segment of the chain CG, called a grapheme, comprising 
one or more graphic elements, to a segment, called a 
phoneme, comprising one or more phonetic elements 
resulting from segmentation of the corresponding phonetic 
chain CP. 

A graphic chain CG comprises M consecutive graphic 
elements gi to gm and the phonetic chain CP corresponding 
to the chain CG comprises N consecutive phonetic elements 
pi to Pn. The integer N may be different from or equal to 
the integer M. 

The probability P (gi, . . . gm^ • . . gw I Pi, • . . Pn, . . -Pn) that 
the chain CG matches the chain CP, where 1 ^ m :^ M and 
1 ^ n ^ N, is determined as a function of the first 
elementary transcription probabilities P(gilpj) estimated 
and stored beforehand in the step E2 and from similarity 
between the chains CG and CP. The similarity is based on 
the Damerau-Levenshtein Metric (DLM) but using 
maximization instead of minimization. The probability 
P (CGI CP) is determined by dynamic programming using the 
following iterative formula for any pair m,n such that 
1 :S n < N and 1 < m ^ M: 

P(gig2. . .gmlPiP2- . .p„) =P (gn, i Pn) max [P (gigs . . .gm-llPlP2- • - Pn), 

P(gig2. . .gmlpiP2. • -Pn-l) , P(gig2. ■ ■ gm-l I P1P2 • • -Pn-l) ] . 

The concatenated nature of the global chain 
transcriptions and the grapheme/phoneme transcriptions 
means that Markov models may be applied efficaciously. 
For the given probability of transcription of a chain 
gi,g2...gni into a chain P1P2. • -Pn, the extension of the 
graphic, respectively phonetic, chain by a new graphic 
element gm+i/ respectively phonetic element P n+i/ gives 
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rise either to the same phonetic chain, respectively 
graphic chain, or to the addition of a new phonetic 
element, respectively graphic element. Expressed in terms 
of probability, P (gig2 • • • gm+i I P1P2 • • •Pn+i) depends only on 
5 the probabilities of three possible transcriptions: 

P(gig2. . .gmlpiPa- • -Pn+i) / 

P{gig2. . .gm+llPlP2. • -Pn) / 

P(gig2. . .gmlPlP2. . .Pn) ■ 

That dependency is expressed by the DLM metric 
10 equal to the highest of the above three possibilities. 

After setting the indices m and n to zero for a 
global transcription (CGI CP) in the step E3 and 
incrementing the indices m and n by 1 in the steps E4 and 
E5, iterations in the steps E6 and E7 begin by 
15 determining the probabilities so that the M successive 
concatenations of the graphic elements gi to gM of the 
chain CG match the first phonetic element pi of the chain 
CP, i.e.: 

P(gi/ • . .gmlPi) = P(gmlPi) max[P(gi. . .gm-ilpi) ] 
20 where 1 ^ m ^ M, and starting with the elementary 
probability P(gxlPi)- As shown by the step E8, the 
process then determines by iteration the probabilities of 
the M concatenations of the graphic elements gi to gw of 
the chain CG matching the first two phonetic elements pi 
25 and P2 of the chain CP using the probabilities previously 
determined for the first graphic element pi, i.e.: 
P(gi, . • .gmlPi, P2) = P(gmlP2) max[P(gi, . . .gm-ilpa) , 

P(gi, . . .gmlPi) , P(gi, • • -gm-ilpi) ] • 

The process then continues by adding a phonetic 
30 element Pn to determine the M probabilities P (gi | pi, . . .pn) 
to P (gi, . . . , gnl Pi, . . .Pn) up to the M probabilities 
relating to the chain CP = (pi, . . .p^) . By iteration of 
the steps E4 to E8, the computer progressively constructs 
and stores a matrix of second probabilities 
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P (gi, . . . gm I Pi, . . . Pn) with M columns for successive 
concatenations of the M graphic elements and N rows for 
successive concatenations of the N phonetic elements, 
operating row by row as in the above example, beginning 
5 with the probability P(gilpi) and ending with the 
probability P (gi, . . .gwlpi, . . - Pn) . 

Each iteration relating to the (m.n)^'' 
transcription [ (gi, . . .gm) I (pi, . . .pn) ] establishes a link 
between the pair (gmrPn) and the pair with the highest of 

10 the three probabilities determined beforehand for the 
three pairs (gm-i/Pn) f (gm/Pn-i) and (gm-i/Pn-i) • The link is 
stored in the computer. If the pair (gm/Pn) is linked to 
the pair (gm-ifPn)/ it is an elementary transcription from 
(gm-i/gm) to pn^Ht; if the pair (gm/Pn) is linked to the 

15 pair (gm/Pn-i)/ it is an elementary transcription from gm 
to (pn-i/Pn); if the pair (gm^Pn) is linked to the pair 
(gm-i/Pn-i) / it is an elementary transcription from to 

Pn. 

Thus a link is stored in the computer for each 
20 determination of a probability P(gi, . . .gm) I iPi, • • - Pn) • The 
links trace a single path that is also stored 
progressively in the computer and links the first pair 
(gi/ Pi) to the last pair (gti, Pn) in the matrix with M 
columns and N rows. The topology of the single path in 
25 the MxN matrix segments the graphic chains CG into 
graphemes and the phonetic chains CP into phonemes and 
aligns the graphic elements and the phonetic elements in 
biunivocal correspondence. If a segment of the path 
follows a portion of a row between two graphic elements, 
30 the concatenation of the graphic elements of that row 
portion corresponds to the phonetic element of the row 
completed by one or more mute and meaningless phonetic 
elements in order to form a grapheme and phoneme pair 
that has the same number of elements and is stored in the 
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computer. If a segment of the path follows a column 
portion between two phonetic elements, the graphic 
element of the' column plus one or more meaningless 
graphic elements corresponds to the concatenation of the 
5 phonetic elements of that colxamn portion in order to form 
a grapheme and phoneme pair that has the same number of 
elements and is stored in the computer. A change of 
direction of the path in the matrix towards the 
horizontal, the vertical or the diagonal indicates 

10 segmentation of the chains CG and CP. 

A simple example concerns seeking to segment the 
global transcription of the word CG = "beau" into the 
phonetic chain CP = [bo] on the assumption that the step 
E2 estimated the following first individual probabilities 

15 in the corpus C: 

P (bib) =0.9 ; P(e|b)=0.1 ; P(a|b)=0.1 ; P(u|b)=0.1 
P(e|o)=0.2 ; P(a|o)=0.1 ; P(u|o)=0.2 ; P(b|o)=0.1. 

For the transcription (beau | bo) from the corpus, 
the M=4 iterations of the steps E5, E6 and E7 for each of 

20 the N=2 rows of the 4x2 matrix produce the following 
table: 



Pn / gm 


b = gi 


e = gz 


a = g3 


u = g4 


[b] = pi 


0, 9 


<-0,09 


^0, 09 


^0, 0009 


[O] = P2 


1^0,09 


/V0,18 


^0, 018 


^0,0036 



The symbol ^ indicates that the pair (gm, Pn) is 
linked to the pair (gm-i, Pn) ; the symbol -1^ indicates 

25 that the pair (gn,, Pn) is linked to the pair (gm, Pn-i) ; 
and the symbol 1^ indicates that the pair (gm, Pn) is 
linked to the pair (gm-i, Pn-i) • The symbol 1^ associated 
with the transcription (be|bo) indicates that the latter 
has been derived and is therefore linked to the preceding 

30 transcription (b|b) . The symbol 1^ indicates a 
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segmentation boundary between grapheme and phoneme pairs. 
The following alignment is derived from this table: 
b eau 
b o**. 

The symbol * designates a mute and meaningless phonetic 
element. 

To perfect the matches between graphemes and 
phonemes and the matches between graphic elements and 
phonetic elements, preferably in the manner indicated by 
the step Ell, the first probabilities P(gilpi) to P(gilpj) 
of the transcriptions of each of the graphic elements 
respectively into the J phonetic elements (step E2) and 
in particular the contingency numbers Kgipi to Kgipj 
(substep E25) are again estimated as a function in 
particular of the ranks of the phonetic elements placed 
in the given phonetic chains CG that were segmented into 
phonemes in the preceding step ElO. Second probabilities 
P (gi, . . .gmlpi, . . .pn) of MxN second transcriptions of each 
global transcription of a given graphic chain with M 
graphic elements (CG) into a corresponding phonetic chain 
(CP) with N phonetic elements are determined by executing 
the steps E3 to ElO in order for links to be established 
in the next step ElO between pairs (gn„Pn) of a new 
matrix with M columns and N rows and consequently for a 
corrected path to link the last pair (gM,PN) to the first 
pair (gi,pi) in the new MxN matrix of second 
probabilities . 

Thanks to the processing capacity and high 
processing speed of the computer, other iterative loops 
of steps E2 to Ell may be executed in the computer until 
the matching process converges, i.e. until the path 
established becomes constant from one loop to the next. 

After segmentation of all the graphic and phonetic 
chains of the corpus G into graphemes and phonemes, the 
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database stores all matches between graphic and phonetic 
elements and all matches between graphemes and phonemes 
for the whole of the processed corpus C. 

Any new graphic chain added to the corpus can then 
be transcribed automatically into a phonetic chain 
segmented into phonemes, in particular with the aid of 
the matches previously established and stored in 
accordance with the invention, which progressively 
enriches the corpus in the database and increases 
transcription accuracy. 

As already stated, the phonetic transcriptions are 
useful to orthographic error correction software systems 
that recognize lexical errors when entering text on a 
terminal keyboard. Thus when the new graphic chain added 
to the corpus is being entered on a terminal keyboard, 
the phonetic chain segmented into phonemes by means of 
the stored matches is used for orthographic correction of 
the new graphic chain entered. 

The method of the invention may equally well be 
used as a tool for automatically generating SMS short 
messages from a text written in ordinary language. This 
necessitates a training corpus C the transcriptions 
whereof are adapted to the automatic generation of SMS 
messages and respectively match graphic chains CG, such 
as words and phrases, to phonetic chains CP whose 
"phonemes" are phonetically readable by any person who is 
not an expert in phonetics. For example, the corpus 
establishes the following matches (in French) between 
graphic chains and phonetic chains : 



j'ai 


: G 


air 


R 


occupe 


OQP 


cas 


K. 
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Thus a new graphic chain entered in a terminal is 
automatically transcribed by the method of the invention 
into a phonetic chain segmented into phonemes that can be 
read by any person who is not an expert in phonetics by 
5 means of stored matches to be included in an SMS message. 
In the foregoing example, the French phrase "j'ai I'air 
occupe" entered on the terminal is transcribed 
automatically into the following short message to be 
transmitted by- the terminal: Gl'ROQP, the "phonetic 

10 chains" [G] , [1']/ [R] and [OQP] being phonetically 
readable by any user who is not an expert in phonetics . 
Alternatively, the phonetic chains [G] , [1']/ [R] and 
[OQP] may be treated as phonetic elements to constitute a 
phonetic chain [Gl'ROQP]. 

15 The steps of a preferred embodiment of the method 

of the invention are determined by instructions of a 
computer program incorporated into a computer such as a 
terminal, a personal computer, a server or any other 
electronic data processing system. The program 

20 automatically matches graphic elements constituting given 
graphic chains' to phonetic elements constituting 
corresponding phonetic chains, after initially entering 
global transcriptions of the graphic chains into the 
phonetic chains into a database accessible to the 

25 computer and estimating and storing in the database first 
probabilities of elementary transcriptions of graphic 
elements into respective phonetic elements. The program 
includes program instructions which execute the steps of 
the method of the invention when said program is loaded 

30 into and executed in the computer, the operation whereof 
is then controlled by executing the program. 

Consequently, the invention applies equally to a 
computer program adapted to implement the invention, in 
particular a computer program on or in an information 
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medium. This program may use any programming language and 
take the form of source code, object code or an 
intermediate code between source code and object code, 
such as a partially compiled form, or any other form that 
may be desirable for implementing the method of the 
invention. 
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